- Feature Name: `defined_pointer_mangling`
- Start Date: 2019-06-24
- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)

# Summary

Define the behavior of pointer value manipulation beyond `offset` and
`align_offset`. Pointers are ripe targets for optimization by encoding
additional information into their value patterns; in present Rust, this is
undefined behavior that merely happens to not yet miscompile.

# Motivation

Pointers appear ubiquitously in programs, and have constraints on their value
sets that the integral fundamental types do not. This means that projects may
desire to encode additional information in the bits of a pointer value that are
not strictly required to address memory.

For example, alignment requirements mean that pointers to complex objects often
have two or three low bits of the address that are always zero, and thus can
have type or GC information placed in those bits and removed before dereference.
Language engines, such as most JavaScript VMs, also use [NaN-boxing] to store a
union of `f64` and `*_ Object` in a single word.

The Rust compiler and analyzer treats pointers with extreme caution, and marks
as potentially undefined manipulation of pointer bit patterns that may cause
memory unsafety or incorrectness.

This RFC defines a specific process of encoding information into pointers, and
uses the type system to ensure that encoded pointers cannot be dereferenced
until they have been decoded to their original, correct, value.

# Guide-level explanation

The compiler defines a lang-item trait, `PointerEncoding`, which may be
implemented to encode pointers into another scheme, and decode from that scheme
back into pointers. The compiler will permit the manipulation of pointers’s bit
values in this trait without considering the pointers invalid to their original
allocation regime.

This trait is `unsafe`, as it exposes a contract that the compiler is not
guaranteed to check, but the implementor must uphold: decoding an encoded
pointer must always yield the original pointer value.

```text
for all pointer:
  pointer == pointer.encode().decode()
```

This process must always be transparent to actual memory accesses; it can only
be used to alter pointer value patterns during storage and transport.

When a project needs to encode information in a pointer, they will construct a
type to manage the encoded values, and implement the lang-item trait to encode
into, and decode from, the encoding type.

For example, if the conversion trait were:

```rust
#[lang = "pointer_encode"]
pub unsafe trait PointerEncoding<Target: ?Sized>: Sized {
    fn encode(ptr: *const Target) -> Self;
    fn decode(self) -> *const Target;
    fn encode_mut(ptr: *mut Target) -> Self {
        (ptr as *const Target).encode()
    }
    fn decode_mut(self) -> *mut Target {
        self.decode() as *mut Target
    }
}
```

then an implementor would select an encoding scheme and implement
`PointerEncoding` to bridge the encoding type and Rust pointers.

```rust
pub struct JsObject;
pub struct NanBox {
    inner: f64,
}

impl NanBox {
    const NAN_MASK: u64 = 0b0111_1111_1111 << 52;
}

unsafe impl PointerEncoding<JsObject> for NanBox {
    fn encode(ptr: *const JsObject) -> Self {
        assert_eq!(ptr as usize as u64 & Self::NAN_MASK, 0);
        Self {
            inner: f64::from_bits(ptr as usize as u64 | Self::NAN_MASK),
        }
    }

    fn decode(self) -> *const JsObject {
        assert!(self.inner.is_nan());
        (self.inner.to_bits() & !Self::NAN_MASK) as usize as *const JsObject
    }
}
```

This example implements NaN-boxing of pointers that are 52 bits or narrower, by
storing the pointer address in the mantissa of a NaN `f64`. The only way for a
Rust program to validly NaN-box a pointer is by calling
`<NanBox as PointerEncoding<JsObject>>::encode` on pointers to `JsObject`. The
type system disallows encoding any other pointer type, and the Miri evaluator
disallows mutating the bit patterns of bare pointers anywhere outside of this
trait’s implementations.

This trait can be implemented generically if the encoding scheme is agnostic to
the type of the referent, or implemented specifically if the project only
mangles certain pointers.

Once encoded, the program must not modify the instances irreversibly. At the
site of `PointerEncoding::decode` calls, the decoded pointer must always be
exactly equal to the initial pointer that was used to create the encoded
instance.

It is important to note that this trait does not support modifying the program’s
allocation regime or memory context. It is only used to encode in-flight
information in pointer values. It is equivalent to defining a trait that
converts integer values between 2’s-complement and sign/magnitude encodings. The
machine must always work on the original 2’-s complement encoding, but the
program may choose to carry integers in sign/magnitude while not being used for
arithmetic.

# Reference-level explanation

This RFC defines a lang-item trait that the compiler and Miri will allow to
manipulate pointer *representations* without considering those manipulations to
modify the memory model of the program. The trait must be `unsafe`, as it
exposes the memory model to the programmer’s code that the compiler may not be
able to verify as sound. The additional contract imposed by this trait is that
it must be total over all input pointers, and reversible such that decoding an
encoded pointer always produces the original pointer.

Since Rust does not allow exposing mutability as a generic type parameter, this
trait *likely* offers `_mut` variants to correctly add and remove mutability
separately from the bit pattern manipulations. It remains undefined behavior to
incorrectly add mutability to an immutable pointer.

The exact trait definition is open to bikeshed. The current proposal is, as
outlined above:

```rust
#[lang = "pointer_encode"]
```

The trait is marked as a lang item so that the compiler and Miri know to allow
it to manipulate pointer values without modifying the memory model.

```rust
pub unsafe trait PointerEncode<Target: ?Sized>: Sized {
```

The trait implementations may choose the set of referent types whose pointers
may be encoded by an implementor. The only default bound is that the referent
type may be unsized; narrower constraints are up to the specific implementor.

```rust
    fn encode(ptr: *const Target) -> Self;
    fn decode(self) -> *const Target;
```

These functions *must* be provided by each implementor. They translate a pointer
to its encoded form and back. The `decode` function must be the inverse of the
`encode` function, so for any pointer value, `p.encode().decode()` in an
implementation must always return `p`.

```rust
    fn encode_mut(ptr: *mut Target) -> Self {
        <Self as PointerEncode<Target>>::encode(ptr as *const Target)
    }
    fn decode_mut(self) -> *mut Target {
        <Self as PointerEncode<Target>>::decode(self) as *mut Target
    }
}
```

These functions provide named, checkable, mutability control for the encoding
mechanism.

This trait is added to the `core::ptr` module, and re-exported by `std::ptr`.

The compiler and Miri are modified to allow concrete implementation types to be
treated as pointer-like in the evaluation model, and track allocation
information through encodings. They also permit these implementations to modify
the bit patterns of pointers, and do not consider these modifications to change
the memory model.

Miri’s evaluation engine MAY, and probably SHOULD, be enabled to check that
encode/decode operations are correctly reversible. If Miri can dynamically
compile the implementations, it may choose to analytically feed pointers into
the implementations and ensure that the output is correct; this is less complex
and less formal than a symbolic analysis of the implementations, but likely
easier to demonstrate individual correctness.

As stated above, it remains undefined behavior for an incorrect implementation
to use the encoding process to cause a pointer to escape its allocation block
and cross into another allocation. It may be worthwhile to restrict the trait
functions to be `const fn`.

# Drawbacks

This allows users to pollute the pointer and reference model of a program by
changing the addressed value of a pointer. If `decode` does not perfectly undo
the `encode` function, then the result of `p.encode().decode()` becomes invalid
and the program enters incoherent state.

# Rationale and alternatives

- Why is this design the best in the space of possible designs?

  It provides a single, dedicated place in the compiler for a program to specify
  pointer encodings, and enables future work in the compiler to check that the
  encoding process is correctly used and does not corrupt the program’s memory
  model.

- What other designs have been considered and what is the rationale for not
  choosing them?

  I don’t know.

- What is the impact of not doing this?

  Well-defined pointer encoding remains absent in the compiler, and future
  changes to Miri cause undefined-behavior propagation to poison pointer
  encodings. This prevents writing performance-equivalent Rust counterparts to
  existing projects which use pointer encoding, such as language runtimes. As
  these tend to be the large, safety-critical projects for which Rust is
  marketed, Rust should offer well-built tools to perform the work they need to
  do.

# Prior art

A plethora of pointer-encoding schemes exist. NaN-boxing, Ruby’s `Fixnum`, OCaml
GC-tagging, and surely more examples of pointer/data unification demonstrate the
performance benefits of localizing immediate data with indirect data in large
programs. The fact that pointer encoding is used in widely-deployed,
user-facing, and highly visible projects such as JavaScript engines indicates
that it is both a useful thing to do and something that can be feasibly hardened
against malicious attack.

# Unresolved questions

- What parts of the design do you expect to resolve through the RFC process
  before this gets merged?

  The signature of the encoding trait is entirely up to discussion. Should it
  handle references differently from raw pointers? Should the trait functions
  be fallible for pointers that the encoding scheme cannot handle?

- What parts of the design do you expect to resolve through the implementation
  of this feature before stabilization?

  If and how the compiler and Miri demonstrate correctness of an encoding
  implementation.

- What related issues do you consider out of scope for this RFC that could be
  addressed in the future independently of the solution that comes out of this
  RFC?

  Pointer value validity. This RFC is strictly concerned with encoding and
  decoding pointers; the manner in which the compiler and Miri check that
  decoded pointers still make sense in the program’s memory model is left to
  those projects.

# Future possibilities

It is undefined behavior for a `PointerEncoding` implementation to produce
`pointer != decode(encode(pointer))`, and Miri is permitted to begin rejecting
programs at any time if it can demonstrate that an implementation does so.

[NaN-boxing]: https://softwareengineering.stackexchange.com/questions/185406/what-is-the-purpose-of-nan-boxing
[summary]: #summary
[motivation]: #motivation
[guide-level-explanation]: #guide-level-explanation
[reference-level-explanation]: #reference-level-explanation
[drawbacks]: #drawbacks
[rationale-and-alternatives]: #rationale-and-alternatives
[prior-art]: #prior-art
[unresolved-questions]: #unresolved-questions
[future-possibilities]: #future-possibilities
